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Abstract 

Recent  studies  have  demonstrated  advantages  of  in¬ 
formation  fusion  based  on  sparsity  models  for  multi¬ 
modal  classification.  Among  several  sparsity  models,  tree- 
structured  sparsity  provides  a  flexible  framework  for  extrac¬ 
tion  of  cross-correlated  information  from  different  sources 
and  for  enforcing  group  sparsity  at  multiple  granularities. 
However,  the  existing  algorithm  only  solves  an  approxi¬ 
mated  version  of  the  cost  functional  and  the  resulting  so¬ 
lution  is  not  necessarily  sparse  at  group  levels.  This  pa¬ 
per  reformulates  the  tree -structured  sparse  model  for  mul¬ 
timodal  classification  task.  An  accelerated  proximal  algo¬ 
rithm  is  proposed  to  solve  the  optimization  problem,  which 
is  an  efficient  tool  for  feature -lev  el  fusion  among  either  ho¬ 
mogeneous  or  heterogeneous  sources  of  information.  In 
addition,  a  (fuzzy -set-theoretic)  possibilistic  scheme  is  pro¬ 
posed  to  weight  the  available  modalities,  based  on  their 
respective  reliability,  in  a  joint  optimization  problem  for 
finding  the  sparsity  codes.  This  approach  provides  a  gen¬ 
eral  framework  for  quality-based  fusion  that  offers  added 
robustness  to  several  sparsity -based  multimodal  classifica¬ 
tion  algorithms.  To  demonstrate  their  efficacy,  the  proposed 
methods  are  evaluated  on  three  different  applications  -  mul¬ 
tiview  face  recognition,  multimodal  face  recognition,  and 
target  classification. 

1.  Introduction 

Information  fusion  using  multiple  sensors  often  results  in 
better  situation  awareness  and  decision  making  [7].  While 
the  information  from  a  single  sensor  is  generally  localized 
and  can  be  corrupted,  sensor  fusion  provides  a  framework  to 
obtain  sufficiently  local  information  from  different  perspec¬ 
tives,  which  is  expected  to  be  more  tolerant  to  the  errors 
of  individual  sources.  Moreover,  the  cross- correlated  infor¬ 
mation  of  (possibly  heterogeneous)  sources  can  be  used  for 


Asok  Ray 

Pennsylvania  State  University 

axr2@psu . edu@psu . edu 

Kenneth  W.  Jenkins 
Pennsylvania  State  University 

jenkins@engr.psu.edu 


context  learning  and  enhanced  machine  perception  [27]. 

Fusion  algorithms  are  usually  categorized  into  two  lev¬ 
els:  feature  fusion  [22]  and  classifier  fusion  [20,  23].  Fea¬ 
ture  fusion  methods  combine  various  features  extracted 
from  different  sources  into  a  single  feature  set,  which  are 
then  used  for  classification.  On  the  other  hand,  classifier 
fusion  aggregates  decisions  from  several  classifiers,  where 
each  classifier  is  built  upon  separate  sources.  While  clas¬ 
sifier  fusion  has  been  well  studied,  feature  fusion  is  a  rela¬ 
tively  less-studied  problem,  mainly  due  to  the  incompatibil¬ 
ity  of  feature  sets  [21].  A  naive  way  of  feature  fusion  is  to 
concatenate  features  into  a  longer  one  [30],  which  may  suf¬ 
fer  from  the  curse  of  dimensionality.  Moreover,  the  concate¬ 
nated  feature  does  not  contain  the  cross-correlated  informa¬ 
tion  among  the  sources  [22].  However,  if  above  limitations 
are  mitigated,  feature  fusion  can  potentially  outperform  the 
classifier  fusion  [12]. 

Recently,  sparse  representation  has  attracted  the  interest 
of  many  researchers,  both  for  reconstructive  and  discrim¬ 
inative  tasks  [26].  The  underlying  assumption  is  that  if  a 
dictionary  is  constructed  with  the  training  samples  of  all 
classes,  only  a  few  atoms  of  the  dictionary,  with  the  same 
label  as  the  test  data,  should  contribute  to  reconstruct  the 
test  sample.  Feature  level  fusion  using  sparse  representa¬ 
tion  has  also  been  recently  introduced  and  is  often  referred 
to  as  “multi-task  learning”  in  which  the  general  goal  is  to 
represent  samples  jointly  from  several  tasks/sources  using 
different  sparsity  priors  [24,  25,  29].  In  [18],  a  joint  sparse 
model  is  introduced  in  which  multiple  observations  from 
the  same  class  are  simultaneously  represented  by  a  few 
training  samples.  In  other  words,  observations  of  a  single 
scenario  from  different  modalities  would  generate  the  same 
sparsity  pattern  of  representing  coefficients,  which  lies  in  a 
low-dimensional  subspace.  Similarly,  modalities  are  fused 
with  a  joint  sparsity  model  in  [18]  and  [24]  for  target  classi¬ 
fication  and  biometric  recognition,  respectively.  Joint  spar¬ 
sity  model  relies  on  the  fact  that  all  the  different  sources 
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share  the  same  sparsity  patterns  at  atom  level,  which  is  not 
necessarily  true  and  may  limit  its  applicability. 

Another  proposed  solution  is  to  group  the  relevant  (cor¬ 
related)  tasks  together  and  seek  common  sparsity  within  the 
group  only  [9]  or  allowing  small  collaboration  between  dif¬ 
ferent  groups  [16].  A  more  generalized  approach  is  pro¬ 
posed  in  [11]  for  multi-task  regression  in  which  different 
tasks  can  be  grouped  in  a  tree- structured  sparsity  providing 
flexibility  in  fusion  of  different  sources.  Although  the  for¬ 
mulation  of  tree- structured  sparsity  proposed  in  [11]  pro¬ 
vides  a  framework  to  model  different  sparsity  structures 
among  multiple  tasks,  the  proposed  optimization  algorithm 
only  solves  an  approximation  of  the  formulation  and  there¬ 
fore  cannot  enforce  the  desired  sparsity  within  different 
groups  and  sparsity  can  only  be  achieved  within  each  task, 
separately.  Moreover,  in  all  the  discussed  multimodal  fu¬ 
sion  algorithms,  including  tree- structured  sparsity,  differ¬ 
ent  modalities  are  assumed  to  have  equal  contributions  for 
classiflcation  task.  This  can  signiflcantly  limits  the  perfor¬ 
mance  of  fusion  algorithms  in  dealing  with  occasional  per¬ 
turbation  or  malfunction  of  individual  modalities.  In  [24], 
a  quality  measure  based  on  the  joint  sparse  representation 
is  introduced  to  quantify  the  quality  of  the  data  from  dif¬ 
ferent  modalities.  However,  this  index  is  measurable  only 
after  the  sparse  codes  are  obtained.  Moreover,  it  measures 
the  sparsity  level  of  the  representing  coefficients  which  does 
not  necessarily  reflect  the  quality  of  individual  modalities. 

The  major  contributions  of  the  paper  are  as  follows: 

(i)  Reformulation  and  efficient  optimization  of  the  tree- 
structured  sparsity  for  multimodal  classification:  A  finite 
number  of  separated  problems  are  efficiently  solved  us¬ 
ing  the  proximal  algorithm  [8]  to  provide  an  exact  solu¬ 
tion  to  the  tree- structured  sparse  representation.  The  pro¬ 
posed  learning  facilitates  feature  level  fusion  of  homo¬ 
geneous/heterogeneous  sources  at  multiple  granularities. 

(ii)  Quality -based  fusion:  A  (fuzzy- set- theoretic)  possi- 
bilistic  approach  [10,  15]  is  proposed  to  quantify  the  qual¬ 
ity  of  different  modalities  in  joint  optimization  with  the  re¬ 
construction  task.  The  proposed  framework  can  be  inte¬ 
grated  with  different  sparsity  priors  {e.g.  joint  sparsity  or 
tree- structured  sparsity)  for  quality-based  fusion.  The  pro¬ 
posed  method  places  larger  weights  on  those  modalities 
which  have  smaller  reconstruction  errors,  {in)  Improved 
performance  for  multimodal  classification:  The  improved 
performances  and  robustness  of  the  proposed  algorithms  are 
illustrated  on  three  applications  -  multiview  face  recogni¬ 
tion,  multimodal  face  recognition,  and  target  classification. 

The  rest  of  the  paper  is  organized  as  follows.  In  Sec¬ 
tion  2,  after  briefly  reviewing  the  joint  sparsity  model,  mul¬ 
timodal  tree- structured  sparsity  is  reformulated  and  solved 
using  the  proximal  algorithm.  Section  3  proposes  the 
quality-based  fusion  which  is  followed  by  comparative 
studies  in  Section  4  and  conclusions  in  Section  5. 


2.  Multimodal  sparse  representation 

This  section  reviews  the  joint  sparse  representation 
classifiers  and  reformulates  the  tree- structured  sparsity 
model  [1  ]  as  a  multimodal  classifier.  A  proximal  algorithm 
is  then  proposed  to  solve  the  associated  optimization. 

2.1.  Joint  sparse  representation  classification 

Let  S  =  {l,...,5'}bea  finite  set  of  available  modali¬ 
ties  used  for  multimodal  classification  and  C  be  the  num¬ 
ber  of  different  classes  in  the  dataset.  Let  the  training  data 
consist  of  Nc  training  samples  from  S  modal¬ 

ities,  where  Nc  is  the  number  of  training  samples  in  the 
class.  Let  G  5,  be  the  dimension  of  the  fea¬ 

ture  vector  for  the  modality  and  j  G  denote  the 
jth  sample  of  the  modality  belonging  to  the  class, 
where  j  G  Nc}.  In  JSRC,  S  dictionaries  = 

G  ,  s  e  S,  are  constructed  from  the 

(normalized)  training  samples,  where  the  class-wise  sub¬ 
dictionary  •  •  •  5  ^c,Nc]  ^  con¬ 

sists  of  samples  from  the  class  and  modality. 

Given  the  test  samples  G  ,  s  G  5,  observed  by  S 
different  modalities  from  a  single  event,  the  goal  is  to  clas¬ 
sify  the  event.  In  the  sparse  representation  classification, 
the  key  assumption  is  that  a  test  sample  from  the 
class  lies  approximately  within  the  subspace  formed  by  the 
training  samples  of  the  c^^  class  and  can  be  approximated 
(or  reconstructed)  from  a  few  number  of  training  samples  in 
XI  [26].  That  is,  if  the  test  sample  belongs  to  the 
class,  it  is  represented  as: 

+  (1) 

where  G  R^  is  a  coefficient  vector  whose  entries  are 
mostly  O’s  except  for  some  of  the  entries  associated  with  the 
class,  i.e.,  =  [O^, . . . ,  0^,  0^, . . . ,  0^]  and 

e  is  a  small  error  term  due  to  imperfectness  of  the  samples. 

In  addition  to  the  above  assumption,  JSRC  enforces  col¬ 
laboration  among  different  modalities  to  make  a  joint  de¬ 
cision,  where  the  coefficient  vectors  from  different  modal¬ 
ities  are  forced  to  have  the  same  sparsity  pattern.  That 
is,  the  same  training  samples  from  different  modalities  are 
used  to  reconstruct  the  test  data.  The  coefficient  matrix 
A  =  [a^, . . . ,  G  where  ol^  is  the  sparse  coef¬ 

ficient  vector  for  reconstructing  y^,  is  recovered  by  solving 
the  following  ii/£q  joint  optimization  problem  with  q  >  1: 

argmin  f{A)  +  (2) 

In  Eq.  (2),  /(A)  ^  i  ELi  11?/*  “  is  the  re- 

construction  error,  ii/iq  norm  is  defined  as  = 

1 1  I  kg  which  a^’s  are  row  vectors  of  A,  and 
A  >  0  is  a  regularization  parameter.  The  number  q  is  usu¬ 
ally  set  to  2  and  thus  the  second  term  in  the  cost  function 


is  refereed  as  ^1/^2  penalty  term.  The  above  optimization 
problem  encourages  sharing  of  patterns  across  related  ob¬ 
servations  so  that  the  solution  A  has  a  common  support 
at  the  column  level  [18],  which  can  be  obtained  by  using 
different  optimization  algorithms  (e.g.  alternating  direction 
method  of  multipliers  [28]).  The  proximal  algorithm  is  used 
in  this  paper  [19]. 

Let  Sc(a)  G  be  a  vector  indication  function  in  which 
the  rows  corresponding  to  class  are  retained  and  the  rest 
are  set  to  zeros.  The  test  data  is  classified  using  the  class- 
specific  reconstruction  errors  as: 

S' 

c*  =  argmin^  ||t/®  -  (3) 

C  1 

S  =  1 

where  a^*’s  are  optimal  solutions  of  Eq.  (2). 

2.2.  Multimodal  tree-structured  sparse  representa¬ 
tion  classification 

As  discussed  in  Section  I ,  although  different  sources  are 
correlated,  the  joint  sparsity  assumption  of  JSRC  may  be 
too  restrictive  for  some  applications  in  which  not  all  the  dif¬ 
ferent  modalities  are  equally  correlated  and  stronger  corre¬ 
lations  between  some  groups  of  the  modalities  may  exist. 

Tree- structured  sparsity  model  provides  a  flexible  frame¬ 
work  to  enforce  prior  knowledge  in  grouping  different 
modalities  by  encoding  them  in  a  tree,  where  each  leaf  node 
represents  an  individual  modality  and  each  internal  node 
represents  a  grouping  of  its  child  nodes.  This  arrangement 
allows  modalities  to  be  grouped  at  multiple  granularity  [  1 1  ] . 
Adopting  the  definition  in  [8],  a  tree- structured  groups  of 
modalities  G  ^  (2^  \  0)  is  defined  as  a  collection  of  sub¬ 
sets  of  the  set  of  modalities  S  such  that  [jg^g  g  =  S  and 
6  (5  n  5  7^  0)  ^  {{g  Cg)v{gC  g)).  It  is  as- 

sumed  here  that  Q  is  ordered  according  to  relation  ^  which 
is  defined  as  ^  ^  ^  V  D  ^  =  0)).  If  the 

prior  knowledge  about  grouping  of  modalities  is  not  avail¬ 
able,  then  hierarchical  clustering  algorithms  could  be  used 
to  find  the  tree  structure  [I  I]. 

Given  a  tree- structured  collection  Q  of  groups,  the 
proposed  multimodal  tree- structured  sparse  representation 
classification  (MTSRC)  is  formulated  as: 

argmin  f{A)  XQ  (A)  (4) 

A=[q;1 

where  f{A)  is  defined  the  same  as  in  Eq.  (2),  and  the  tree- 
structured  sparse  model  is  defined  as: 

N 

ft  (A)  =  '^^'^^iUgWajgWi^.  (5) 

i=i geQ 

In  Eq.  (5),  ujg  is  a  positive  weight  for  group  g  and  ajg  is  a 
{1  X  S)  row  vector  whose  coordinates  are  equal  to  the 
row  of  A  for  indices  in  the  group  g,  and  0  otherwise. 


The  above  optimization  problem  allows  sharing  of  pat¬ 
terns  across  related  groups  of  modalities.  Thus  the  optimal 
solution  A*  has  a  common  support  at  the  group  level  and 
the  resulting  sparsity  is  dependant  on  the  relative  weights 
uog  of  different  groups  [11].  Having  obtained  A*,  the  test 
samples  are  classified  using  (3).  The  tree- structured  spar¬ 
sity  provides  a  flexible  framework  to  enforce  different  spar¬ 
sity  priors.  Eor  example,  if  Q  consists  of  only  one  group, 
containing  all  modalities,  then  (4)  reduces  to  that  of  JSRC 
in  (2).  In  another  example  where  G  consists  of  only  sin¬ 
gleton  sets  of  individual  modalities,  no  sparsity  pattern  is 
sought  among  different  modalities  and  the  optimization  (4) 
reduces  to  S  separate  ii  optimization  problems. 

2.3.  Optimization  algorithm 

As  discussed  in  Section  1,  the  optimization  procedure 
proposed  in  [11]  for  tree- structured  sparsity  only  solves  an 
approximated  version  of  the  optimization  problem  (4)  and, 
therefore,  does  not  necessarily  results  in  a  solution  with  de¬ 
sired  group  sparsity.  In  this  section,  an  accelerated  proxi¬ 
mal  gradient  method  [19]  is  used  to  solve  (4)  in  which  the 
optimal  solution  is  obtained  by  solving  a  finite  number  of 
tractable  optimization  problems  without  approximating  the 
cost  function.  Let  the  initial  value  of  A,  which  can  be  cho¬ 
sen  as  an  arbitrary  sparse  vector,  be  zero.  Then,  at  iter¬ 
ation,  the  proposed  accelerated  proximal  gradient  is  as  fol¬ 
lows  [19]: 

^k+l  =  +  pfe  f^k  _  ^k-l\ 

^  ^  (6) 
-  t’^Vf  (5*=+!)) 

where  is  the  step  size  at  time  step  k  which  can  be  set  as 
a  constant  or  be  updated  using  a  line  search  algorithm  [19]; 
A^  is  the  estimation  of  the  optimal  solution  A  at  time  step 
k;  the  extrapolating  parameter  could  be  chosen  as 
and  the  associated  proximal  optimization  problem  is  de¬ 
fined  as: 

Prox^n(V)=  argmin  0(17) T||[/_  v||2,^  (7) 

zp 

where  || .  ||f  is  the  Erobenius  norm  .  Using  Eq.  (5),  the  prox¬ 
imal  optimization  problem  is  reformulated  as: 


Prox^n  (^)  = 


N 

argmin 


\9eG 


1 

¥ 


ll^i 


(8) 


where  ujg  is  defined  similar  to  ajg  in  Eq.  (5);  and  uj  and  Vj 
are  the  rows  of  U  and  V,  respectively.  Consequently, 
the  solution  of  (8)  is  obtained  by  N  separate  optimizations 
on  ^'-dimensional  vectors.  Since  the  groups  are  defined  to 
be  ordered,  each  of  the  optimization  problems  can  be  solved 


Algorithm  1  Algorithm  to  solve  the  proximal  optimization  step 
(Eq.  (8))  of  the  accelerated  proximal  gradient  method  (Eq.  (6)) 
corresponding  to  the  MTSRC  optimization  problem  (Eq.  (4)) 


Input;  V  G  ordered  set  of  groups  Q,  weights  for  each  group 

g  ^  Q,  and  scaler  /3. 

Output:  U  eR^^^ 

1:  for  j  =  1, .  . . ,  A  do 

Let  =  ■  ■  ■  =  =  0  and  Uj  =  Vj. 

for  g  =  gi,g2r  ■  ■  ^  ^  do 

Uj  =Vj 

'^^\\Ujg\U-2  > 


5: 


6: 

7: 


rjy  = 

end  for 


^jg\\£2 


if  llitjglka  <  0ojg 


end  for 


=  '^j  -  Y:geg  ^ 


in  a  single  iteration  using  the  dual  form  [8].  Therefore,  the 
proximal  step  of  the  tree- structured  sparsity  can  be  solved 
with  the  same  computational  cost  as  that  of  joint  sparsity. 
Algorithm  1,  which  is  a  direct  extension  of  the  optimiza¬ 
tion  algorithm  in  [8],  solves  the  proximal  step.  It  should  be 
noted  that  the  computational  complexity  of  the  optimization 
algorithm  grows  linearly  as  the  number  of  training  sam¬ 
ples  increases.  One  can  potentially  learn  the  dictionaries 
with  (significantly)  fewer  number  of  atoms  using  dictionary 
learning  algorithms  [13].  This  paper  uses  the  Sparse  Mod¬ 
eling  Software  [8]  to  solve  the  proximal  step. 


3.  Weighted  scheme  for  quality-based  fusion 

In  most  of  the  sparsity-based  multimodal  classification 
algorithms,  including  JSRC  and  MTSRC,  it  is  inherently 
assumed  that  available  modalities  contribute  equally.  This 
may  significantly  limit  the  performance  in  dealing  with  oc¬ 
casional  perturbation  or  malfunction  of  individual  sources. 
Ideally,  a  fusion  scheme  should  adaptively  weight  the 
modalities  based  on  their  reliabilities.  In  [24],  a  quality 
measure  is  introduced  for  JSRC,  where  a  sparsity  concentra¬ 
tion  index  is  calculated  to  quantify  the  quality  of  modalities. 
The  main  limitation  of  this  approach,  however,  is  that  the 
index  is  obtained  only  after  the  sparse  codes  are  calculated 
and  a  weak  modality  may  hurt  the  performances  of  other 
modalities  due  to  the  enforced  sparsity  priors.  Moreover, 
the  index  is  defined  based  on  the  sparsity  levels  and  does 
not  necessarily  refiect  the  quality  of  each  modalities.  This 
paper  proposes  to  find  the  adaptive  quality  of  each  modality 
and  sparse  codes  jointly  in  a  single  optimization  problem. 

Let  /i^  be  the  quality  weight  for  the  modality.  A 
weighted  scheme  for  multimodal  reconstruction,  with  simi¬ 
lar  structure  to  Eq.  (4),  is  proposed  as  follows: 

argmin  L]  „  \\y"  -  +  {A) ,  (9) 

with  the  constraint  >  0,  Vs  E  S,  where  the  exponent 
m  G  (1,  cxo)  is  a  fuzzifier  parameter,  similar  to  formulation 


of  fuzzy  c-means  clustering  [3];  and  ^  (A)  enforces  desired 
sparsity  priors  within  the  modalities  (e.g.  ^1/^2  constraint  in 
JSRC  or  tree- structured  sparsity  prior  of  MTSRC). 

Another  constraint  on  is  necessary  to  avoid  a  degen¬ 
erate  solution  of  Eq.  (9).  A  constraint  such  as  /i^  =  1 

is  apparently  feasible;  however,  since  m  >  1  in  Eq,  (9), 
the  larger  weight  of  a  modality  compared  to  those  of  other 
modalities  may  effectively  increase  this  weight  close  to  1 
while  forcing  the  rest  of  the  weights  toward  0.  To  alleviate 
this  problem,  a  “possibility” -like  constraint,  similar  to  the 
possibilistic  fuzzy  c-means  clustering  [1,  15],  is  proposed 
to  allow  the  weights  of  different  modalities  to  be  specified 
independently.  The  proposed  composite  optimization  prob¬ 
lem  to  achieve  quality-based  multimodal  fusion  is  posed  as: 

argmin  A]  -h  (A) -h 

A=[q.^  \s=l  ^ 

>0,Vse5  (10) 

S  =  1  / 

where  X^s  are  the  regularization  parameters  for  s  G  S.  Af¬ 
ter  finding  optimal  (/i^)*  and  A*,  the  test  samples  are  clas¬ 
sified  using  the  weighted  reconstruction  errors,  i.e., 

S' 

c*  =  argmin y]]  (/r*)™  \\y^  - 

C  1 

S  =  1 

The  optimization  problem  in  (lO)  is  not  jointly  convex 
in  A  and  A  sub-optimal  solution  can  be  obtained  by 
alternating  between  the  updates  of  A  and  minimiz¬ 

ing  over  one  variable  while  keeping  the  other  variable  fixed. 
The  accelerated  proximal  gradient  algorithm  in  (6)  with 
fA)  =  |Ef=i  isusedtooptimize 

over  A  and  optimal  at  each  iteration  of  the  alternative 
optimization  is  found  in  a  closed  form  [1 5]  as: 

= - - - S  e  5.  (12) 

i+(  a;:^  j 

The  regularization  parameters  X^s  need  to  be  chosen  based 
on  the  desired  bandwidth  of  the  possibility  distribution  for 
each  modality.  In  this  paper,  optimization  over  A  is  first 
performed  without  weighting  scheme  to  find  an  initial  esti¬ 
mate  of  A.  Also  the  following  definition,  similar  to  possi¬ 
bilistic  clustering  [I],  is  used  to  determine  and  fix  X^s : 

X^.  =  \\y^-X^a%^,  (13) 

resulting  all  to  be  0.5  initially.  It  is  observed  that  only 
a  few  iterations  is  required  for  the  algorithm  to  converge. 
In  this  paper,  the  number  of  alternations  and  the  fuzzifier  m 
are  set  to  be  1 0  and  2,  respectively.  Algorithm  2  summarizes 
the  proposed  quality-based  multimodal  fusion  method.  As 
discussed,  this  scheme  can  be  used  with  different  sparsity 
priors  as  long  as  optimal  A  can  be  found  efficiently  [1 9]. 


Algorithm  2  Quality-based  multimodal  fusion. 

Input:  Initial  coefficient  matrix  A,  dictionary  X®  and  test  sample  of 
the  modality,  s  G  {I,--  -  ,5'},  and  number  of  iterations  T. 
Output:  Coefficient  matrix  A  and  weights  ju,^  as  a  solution  to  Eq.(9). 

1:  Set  A^s  using  Eq.  (13) 

2:  for  =  1, . . . ,  T  do 

3:  Update  weights  using  Eq.  (12) 

4:  Update  A  by  solving  Eq.  (9)  with  fixed  . 

5:  end  for 


4.  Results  and  discussion 

In  this  section  we  present  the  results  for  the  proposed 
MTSRC  and  weighted  scheme  in  three  different  applica¬ 
tions:  multiview  face  recognition,  multimodal  face  recog¬ 
nition,  and  multimodal  target  classification.  For  MTSRC, 
the  group  weights  cjg  are  selected  using  a  combination  of 
aprior  information/assumption  and  cross  validation.  We 
assumed  equal  weights  for  the  same  sized  groups  which 
reduces  the  number  of  tuning  parameters.  The  relative 
weights  between  the  groups  with  different  sizes,  however, 
are  not  fixed  and  are  selected  using  cross-validation  in  a  fi¬ 
nite  set  Hierarchical  clustering  can 

also  be  used  to  tune  the  weights  [11].  It  is  observed  that  big¬ 
ger  groups  are  usually  assigned  with  bigger  weights  in  the 
studied  applications.  Thus  MTSRC  intuitively  enforces  col¬ 
laboration  among  all  the  modalities  and  yet  provides  fiex- 
ibility  (compared  to  JSRC)  by  allowing  collaborations  to 
be  formed  at  lower  granularities  as  well.  It  is  observed  in 
all  the  studied  applications  that  MTSRC  performs  similarly 
when  the  weights  are  varied  without  being  reordered. 

For  the  weighted  scheme,  JSRC  and  MTSRC  are 
equipped  with  the  modality  weights  and  the  resulting  al¬ 
gorithms  are  denoted  as  JSRC-W  and  MTSRC- W,  respec¬ 
tively.  The  performance  of  the  proposed  feature-level  fu¬ 
sion  algorithms  is  compared  with  that  of  several  state-of- 
the-art  decision-level  and  feature-level  fusion  algorithms. 
For  decision-level  fusion,  outputs  of  the  independent  clas¬ 
sifiers,  each  trained  on  separate  modality,  are  aggregated 
by  adding  the  corresponding  probability  outputs  of  each 
modality,  which  is  denoted  as  Sum.  For  this  purpose, 
SVM  [4],  sparse  representation  classifier  (SRC)  [26],  and 
sparse  logistic  regression(SLR)  [14]  are  used.  The  pro¬ 
posed  methods  are  also  evaluated  using  existing  feature- 
level  fusion  methods  that  include  holistic  sparse  represen¬ 
tation  classifier  (HSRC)  [30],  JSRC  [18],  joint  dynamic 
sparse  representation  classifier  (JDSRC)  [30]  and  relaxed 
collaborative  representation  (RCR)  [29] . 

4.1.  Multiview  face  recognition 

In  this  experiment,  we  evaluate  the  performance  of  the 
proposed  MTSRC  for  multiview  face  recognition  using 
UMIST  face  database  which  consists  of  564  cropped  im¬ 
ages  of  20  individuals  with  mixed  race  and  gender  [6]. 
Poses  of  each  individual  are  sorted  from  profile  to  frontal 


Figure  1 :  Different  poses  for  one  individual  in  the  UMIST 
database  that  is  divided  into  four  different  view-ranges 
shown  by  four  rows. 
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Figure  2:  Representation  coefficients  generated  by  JSRC, 
JDSRC,  and  MTSRC  on  test  observations  that  correspond 
to  4  different  views  on  the  UMIST  database.  JSRC  enforces 
joint  sparsity  among  all  views  and  requires  the  same  spar¬ 
sity  pattern  at  atom  level.  JDSRC  allows  contributions  from 
different  training  samples  to  approximate  a  set  of  test  obser¬ 
vations  and  requires  the  same  sparsity  pattern  at  class  level. 
MTSRC  enforces  joint  sparsity  only  when  relevant. 

views  and  then  divided  into  S  view-ranges  with  equal  num¬ 
ber  of  images  in  each  view-range.  This  facilitates  multiview 
face  recognition  using  UMIST  database.  The  performance 
of  the  algorithms  are  tested  using  different  values  of  view- 
ranges  S  G  {2,  3, 4}.  It  should  be  noted  that  the  environ¬ 
ment  is  unconstrained  and  captured  faces  may  have  pose 
variations  within  each  view-range.  Different  poses  for  one 
of  the  individual  in  the  UMIST  database  is  shown  in  Fig.  1 
where  the  images  are  divided  into  four  view-ranges,  shown 
in  four  rows.  Due  to  limited  number  of  training  samples,  a 
single  dictionary  is  constructed  by  randomly  selecting  one 
(normalized)  image  per  view  for  all  the  individuals  in  the 
database  which  is  shared  among  different  view-ranges.  The 
rest  of  the  images  are  used  as  the  test  data. 

As  observations  from  closeby  views  are  more  likely  to 
share  similar  poses  and  be  correlated,  the  tree  structured 
sparsity  of  MTSRC  is  chosen  to  enforce  group  sparsity 
within  related  views.  For  this  purpose,  the  tree- structured 
sets  of  groups  using  2,  3,  or  4  view-ranges  are  selected 
to  be  {{1},{2},{1,2}},  {{1},  {2},  {3},  {1,2, 3}}, 

{{1},  {2},  {1, 2},  {3},  {4},  {3, 4},  {1, 2, 3, 4}},  respec- 
tively.  Fig.  2  shows  the  representation  coefficients 
generated  by  JSRC,  JDSRC,  and  MTSRC  on  a  test  scenario 


Table  1:  Multiview  classification  results  obtained  on  the 
UMIST  database. 


2  Views 

3  Views 

4  Views 

Avg. 

HSRC  [30] 

84.37 

97.80 

99.91 

94.03 

JSRC  [18] 

87.77 

99.51 

99.91 

95.73 

JDSRC  [30] 

86.52 

98.96 

99.91 

95.13 

MTSRC 

88.42 

99.63 

100.00 

96.02 

Figure  3:  A  test  image  and  its  occlusion  in  the  AR  dataset. 

Table  2:  Correct  classification  rates  obtained  using  individ¬ 
ual  modalities  in  AR  database.  Modalities  include  left  and 


right  periocular,  nose,  mouth,  and  face. 

L  periocular  R  periocular  Nose 

Mouth 

Face 

SVM  71.00 

74.00 

44.00 

44.29 

86.86 

SRC  79.29 

78.29 

63.43 

64.14 

93.71 

Table  3:  Multimodal 

classification  results  obtained  on  the 

AR  database. 

Method 

CCR 

Method 

CCR 

SVM-Sum  [24] 

92.57 

SLR-Sum  [24] 

77.9 

JDSRC  [30] 

93.14 

RCR  [29] 

94.00 

JSRC  [18] 

95.57 

JSRC-W 

96.43 

MTSRC 

97.14 

MTSRC-W 

97.14 

where  four  different  view-ranges  are  used.  As  expected, 
JSRC  enforces  joint  sparsity  among  all  different  views  at 
atom  level,  which  may  be  too  restrictive  due  to  relatively 
less  correlation  between  frontal  and  profile  views.  JDSRC 
relaxes  joint  sparsity  constraint  at  atom  level  and  allows 
contributions  from  different  training  samples  to  approxi¬ 
mate  one  set  of  the  test  observations  but  still  requires  joint 
sparsity  pattern  at  class  level.  As  shown,  proposed  MTSRC 
enforces  joint  sparsity  within  relevant  views  and  also 
among  all  modalities  and  has  the  most  fiexible  structure 
for  multimodal  classification.  The  results  of  10-fold  cross 
validation  using  different  sparsity  priors  are  summarized 
in  Table  1 .  As  seen,  the  MTSRC  method  achieves  the  best 
performance.  Since  the  quality  of  different  view-ranges 
are  similar,  JSRC-W  and  MTSRC- W  result  in  similar 
performances  as  those  of  JSRC  and  MTSRC,  respectively, 
and  therefore  are  omitted  here. 

4.2.  Multimodal  face  recognition:  AR  database 

In  this  set  of  experiments,  the  performance  of  the  pro¬ 
posed  algorithms  are  evaluated  on  the  AR  database  (Fig¬ 
ure  3)  which  consists  of  faces  under  different  poses,  illu¬ 
mination  and  expression  conditions,  captured  in  two  ses¬ 
sions  [17].  A  set  of  100  users  are  used,  each  consisting  of 


CMCs  for  multimodal  fusion 


Figure  4:  CMCs  obtained  using  multimodal  classification 
algorithms  on  AR  database  with  random  occlusion. 

seven  images  from  the  first  session  as  the  training  samples 
and  seven  images  from  the  second  session  as  test  test  sam¬ 
ples.  A  small  randomly  selected  portion  of  training  sam¬ 
ples,  30  out  of  700,  is  used  as  validation  set  for  optimiz¬ 
ing  the  design  parameters  of  classifiers.  Fusion  is  taken  on 
five  modalities,  similar  to  setup  in  [24],  including  left  and 
right  periocular,  nose,  and  mouth  regions,  as  well  as  the 
whole  face.  After  resizing  the  images,  intensity  values  are 
used  as  features  for  all  modalities.  Results  of  using  individ¬ 
ual  modalities  for  classification  using  SVM  and  SRC  algo¬ 
rithms  are  shown  in  Table  2.  As  expected,  the  whole  face  is 
the  strongest  modality  in  the  sense  of  solving  the  identifica¬ 
tion  task  followed  by  left  and  right  eyes.  For  MTSRC,  the 
groups  are  chosen  to  be  ^  =  {51,52,53,54,55,56,57}  = 
{{1},  {2},  {1,2},  {3},  {4},  (5),  {1,2, 3, 4, 5}},  where  1 
and  2  represents  left  and  right  periocular  and  3, 4,  5  rep¬ 
resents  other  modalities.  Weights  are  selected  using  a  sim¬ 
ilar  approach  discussed  in  Section  4.3  to  encourage  group 
sparsity  between  all  modalities  and  also  joint  representation 
for  left  and  right  periocular  in  lower  granularity.  The  per¬ 
formances  of  the  proposed  algorithms  are  compared  with 
several  competitive  methods  as  shown  in  Table  3.  Fig.  4 
shows  the  corresponding  cumulative  matched  score  curves 
(CMC).  CMC  is  a  performance  measure,  similar  to  ROC, 
which  is  originally  proposed  for  biometric  recognition  sys¬ 
tems  [5].  As  shown,  the  tree- structured  sparsity  based  algo¬ 
rithms  achieve  the  best  performances  with  correct  classifi¬ 
cation  rate  (CCR)  of  97.14%. 

To  compare  the  robustness  of  different  algorithms,  each 
test  images  is  occluded  by  randomly  chosen  block  (See 
Fig.  3).  Fig.  5  shows  the  CMC’s  generated  when  the  size 
of  occluding  blocks  are  15.  As  seen,  the  proposed  tree- 
structured  algorithms  are  the  top  performing  algorithms. 
Fig.  6  compares  CCR  of  different  algorithms  with  block  oc¬ 
clusion  of  increasing  sizes.  MTSRC-W  has  the  most  robust 
performance.  Also  it  is  observed  that  the  weighted  scheme 
significantly  improves  the  performance  of  the  JSRC. 

Since  the  proposed  weighted  scheme  is  solved  using  al¬ 
ternating  minimization,  a  set  of  experiments  are  performed 
to  test  its  performance  sensitivity  to  the  different  initializa- 


Figure  5 :  CMCs  obtained  using  different  multimodal  classi¬ 
fication  algorithm  on  AR  database  with  random  occlusion. 

tion  of  the  modalities  weights  and  regularization  param¬ 
eters  X^s .  In  each  experiment,  all  initial  weights  are  set  to 
a  different  value.  Also  all  X^s  are  set  to  a  different  value, 
instead  of  being  set  by  Eq.  (13).  The  sparse  coefficients 
A  are  initialized  to  zero  in  all  the  experiments.  We  ob¬ 
served  similar  results  for  relatively  large  variations  in  initial 
weights  and  regularization  parameters.  The  performance  of 
the  weighted  scheme  degrades  if  the  regularization  param¬ 
eters  are  set  to  be  too  small.  On  the  other  hand,  its  per¬ 
formance  approaches  that  of  the  non-weighted  scheme  for 
large  values  of  the  regularization  parameters,  as  expected 
by  observing  cost  function  (10).  It  is  also  observed  that  set¬ 
ting  the  regularization  parameters  by  Eq.  (13)  with  all  the 
weights  being  initialized  to  1/S  persistently  works  well. 

4.3.  Multimodal  target  classification 

This  section  presents  the  results  of  target  classification 
on  a  field  dataset  consisting  of  measurements  obtained  from 
a  passive  infrared  (PIR)  and  three  seismic  sensors  of  an 
unattended  ground  sensor  system  as  discussed  in  [2] .  Sym¬ 
bolic  dynamic  filtering  is  used  for  feature  extraction  from 
time-series  data  [2].  The  subset  of  data  used  here  consists 
of  two  days  data.  Day  1  includes  47  human  targets  and  35 
animal-led-by-human  targets  while  the  corresponding  num¬ 
bers  for  Day  2  are  32  and  34,  respectively.  A  two-way  cross- 
validation  is  used  to  assess  the  performance  of  the  classifi¬ 
cation  algorithms,  i.e.  Day  1  data  is  used  for  training  and 
Day  2  is  used  as  test  data  and  vice  versa. 

Eor  MTSRC,  the  tree- structured  set  of  groups 
is  selected  to  be  t/  =  {91,92, 93, 94, 95,96}  = 

{{1}, {2}, {3}, {1,2, 3}, {4}, {1,2, 3, 4}}  where  1,  2 

and  3  refer  to  the  seismic  channels  and  4  refers  to  the  PIR 
channel.  Table  4  summarizes  the  average  human  detection 
rate  (HDR),  human  false  alarm  rate  (HEAR),  and  misclas- 
sification  rates  (MR)  obtained  using  different  multimodal 
classification  algorithms.  As  seen,  the  proposed  JSRC-W 
and  MTSRC- W  algorithms  resulted  in  the  best  HDR  and 
HEAR  and,  consequently  the  best  overall  performance. 
Moreover,  if  the  different  modalities  are  weighted  equally, 
the  MTSRC  achieves  the  best  performance. 


Eigure  6:  Correct  classification  rates  of  multimodal  classi¬ 
fication  algorithms  with  block  occlusion  of  different  sizes. 

Table  4:  Results  of  target  classification  obtained  by  different 
multimodal  classification  algorithms  by  fusing  1  PIR  and  3 
seismic  sensors  data.  HDR:  Human  Detection  Rate,  HEAR: 
Human  Ealse  Alarm  Rate,  M:  Misclassification. 


HDR 

HFAR 

MR 

SVM-Sum 

0.94 

0.15 

10.61% 

HSRC 

0.96 

0.10 

6.76% 

JDSRC 

0.97 

0.09 

8.11% 

RCR 

0.94 

0.12 

8.78% 

JSRC 

1.00 

0.12 

5.41% 

JSRC-W 

1.00 

0.07 

3.38% 

MTSRC 

1.00 

0.09 

4.05% 

MTSRC-W 

1.00 

0.07 

3.38% 

Eigure  7 :  Correct  classification  rates  of  multimodal  classi¬ 
fication  algorithms  in  dealing  with  random  noise  of  varying 
variance  that  is  added  to  the  seismic  1  channel. 

To  evaluate  the  robustness  of  the  proposed  algorithms, 
random  noise  with  varying  variance  is  injected  to  the  test 
samples  of  the  seismic  sensor  1  measurements.  Eig.  7  shows 
the  CCR  obtained  using  different  methods.  The  proposed 
weighted  scheme  has  the  most  robust  performance  in  deal¬ 
ing  with  noise.  It  is  also  seen  that  JSRC  algorithm  performs 
better  than  MTSRC  as  the  level  of  noise  increases.  One 
possible  reason  is  that  in  MTSRC  the  assumption  of  strong 
correlation  between  the  three  seismic  channels  are  not  valid 
with  large  value  of  noises.  Eig.  8  shows  averaged  modality 
weights  generated  by  the  MTSRC- W.  As  expected,  weight 


Noise  variance 

Figure  8:  Averaged  weights  for  each  modality  obtained 
by  MTSRC-W  algorithm  when  test  samples  from  second 
modality,  seismic  sensor  1,  are  perturbed  with  random 
noise.  As  the  noise  level  increases,  the  weight  for  sec¬ 
ond  modality  decreases  while  the  corresponding  weights  for 
other  unperturbed  modalities  remains  almost  constant. 

of  the  perturbed  modality  decreases  as  noise  level  increases. 

5.  Conclusions 

This  paper  presents  the  reformulation  of  tree- structured 
sparsity  models  for  the  purpose  of  multimodal  classification 
and  proposes  an  accelerated  proximal  gradient  method  to 
solve  this  class  of  problems.  The  tree- structured  sparsity  al¬ 
lows  extraction  of  cross-correlated  information  among  mul¬ 
tiple  modalities  at  different  granularities.  The  paper  also 
presents  a  possibilistic  weighting  scheme  to  jointly  repre¬ 
sent  and  quantify  multimodal  test  samples  by  using  several 
sparsity  priors.  This  formulation  provides  a  framework  for 
robust  fusion  of  available  sources  based  on  their  respective 
reliability.  The  results  show  that  the  proposed  algorithms 
achieve  the  state-of-the-art  performances  on  the  field  data  of 
three  applications:  multiview  face  recognition,  multimodal 
face  recognition,  and  multimodal  target  classification. 
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